1 Introduction and Motivation
1.1 Who I Am
1.1.1 Background
- Lead Data Scientist for Statistical Computing at the Urban Institute
- Adjunct Professor in the McCourt School of Public Policy at Georgetown University
- American Statistical Association Traveling Course Instructor
1.1.2 R Projects
- Synthetic data generation (rstudio::conf(2022) talk about
library(tidysynthesis)) - Formal privacy/differential privacy evaluation
- Projects that iterate with R Markdown/Quarto
- Manage the Urban Institute ggplot2 theme (Examples) (Code)
- Urban Institute R Users Group
1.2 Who Are You?
- What types of analyses do you develop?
- What is your programming experience?
- What are you most interested to learn?
1.3 Outline
1.3.1 Goals
- Enthusiasm
- Develop a firm foundation with R
- Leave with enough understanding and resources that you can apply the covered material to your own work
- You will still need to look stuff up!
- I will try to give you hints for where to find help
1.3.2 Process
- Please consider turning on your cameras.
- Please ask questions at any time. You can speak up, raise your hand, or drop it in the chat.
- I need to know how you are doing. Please ask lots of questions and give your reactions.
- I will check in during breaks about pacing and content.
- We will skip some exercises. Don’t worry, I’ve shared solutions to all exercises!
1.4 Content
- Introductions and Motivation
- Grammar of Graphics
- Jon Schwabish’s Five Guidelines for Better Data Visualizations
- Visualizing big data
- Visualizing regression models
- Data munging for visualization
- Visualizing time series data
1.5 Why Data Visualization?
- Data visualization is exploratory data analysis (EDA)
- Data visualization is diagnosis and validation
- Data visualization is communication
1.6 Why ggplot2
1.6.1 1. Looks good!
library(ggplot2) is used by fivethirtyeight, Financial Times, BBC, the Urban Institute, and more.
1.6.2 2. Flexible and expressive
By breaking data visualization into component parts, library(ggplot2) is a set of building blocks instead of a set of rigid cookie cutters.
1.6.3 3. Reproducible and Transparent
I believe in a code-first approach to data analysis.
Code maximizes the chance of catching mistakes when they inevitably happen and code is the clearest way to document and share an analysis.
1.6.4 4. Scalable
It’s almost as easy to make the 100th chart as it is to make the 2nd chart. This allows for iteration.
1.6.5 5. In my analysis workflow
Data visualization is fundamental to EDA, statistical modeling, and basically any work with data. Too many people find themselves using different tools for data visualization and statistical modeling. R/ggplot2 allows everything to happen in the same script at the same time.
Too often, switching from a programming language to Excel, results in parsing errors or cell-reference errors.
1.7 R Markdown
This short course will rely on R Markdown, which is a literate statistical programming framework that combines text and images, code, and code output into output documents like PDFs and web pages. It is like an easier-to-use LaTeX with more flexibility. Instead of .R scripts, we will use .Rmd scripts.
- Markdown
- YAML Header
- Code chunks
1.7.1 Running code in documents
We will mostly run code inside of .Rmd documents.
- Run the code like a .R script
- Run the entire current chunk
- Run all chunks above
1.7.2 Knitting documents
More commonly, documents are knitted. This runs all of the code in the .Rmd in a new R session and then creates an output document like a .html or a .pdf. If the code has errors, knitting will fail.
Click when a .Rmd document is open in RStudio to knit the document.